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To all whom it may concern: 

Be it known that Jeremy Francis Taylor, Sara L.F. Davis, Luke Lind, Scott K. Davis 



have invented certain new and useful improvements in 

METHOD FOR ASSIGNING AN INDIVIDUAL TO A POPULATION OF ORIGIN BASED ON 
MULT-LOCUS GENOTYPES 

of which the following is a full, clear and exact description. 
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METHOD FOR ASSIGNING AN INDIVIDUAL TO A POPULATION OF 
ORIGIN BASED ON MULTI -LOCUS GENOTYPES 

Background Of The Invention 

Throughout this application, various publications are 
referenced in parentheses by author and year. Full 
citations for these references may be found at the 
end of the specification immediately preceding the 
claims. The disclosures of these publications in 
their entireties are hereby incorporated by reference 
into this application to more fully describe the 
state of the art to which this invention pertains. 



The assignment of an individual to a population of 
origin based upon the individual's multi-locus 
genotype is a statistical problem which must consider 
features of the genetic architecture of the 
underlying populations from which the individual may 
have originated. For example, if there exist 
population specific alleles at certain loci (the 
frequency of a population specific allele is zero in 
all but one of the populations), then the presence of 
at least one of these alleles in the genotype of an 
individual indicates unequivocally the population to 
which the individual belongs. Unfortunately, it is 
often difficult to establish that certain alleles are 
population specific, since their absence in a sample 
of individuals from any one population may be either 
because the alleles are truly population specific, or 
because the frequencies of these alleles are low and 
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the sample obtained from any given population was 
small . Clearly, the absence of an allele in a sample 
from a population does not justify the assumption 
that the allele is not present in the population. 

5 

In the absence of a definitive marker for population 
of origin, or in the case where a genotype 
potentially exists in more than one population, 
statistical approaches must be employed to identify 

10 the most likely population of origin from among a set 

of "candidate populations . " Further, these approaches 
must also evaluate the strength of evidence for the 
individual belonging to the most likely population of 
origin over the other competing candidate 

15 populations . Finally, the strength of evidence 

supporting the individual belonging to the most 
likely population of origin against another novel 
population that is not represented among the set of 
candidate populations must also be evaluated. 

20 

The present application discloses a statistical model 
for the assignment of individuals to a population of 
origin that possesses the following features: 

25 1. The approach assumes that samples of individuals 

are available from a number of candidate populations 
and that these individuals have been genotyped for a 
number of marker loci . 

3 0 2 . There may be any number of candidate populations 

and each population may have a different sample size. 
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3. There may be any number of markers that have 
been genotyped in the individuals within each of the 
candidate populations. 

4. The individual to be assigned to a population of 
origin may have been genotyped for all, or only a 
subset of the marker loci. 

5. Marker loci genotypes in each candidate 
population are tested for conformance to Hardy- 
Weinberg Equilibrium (HWE) and Gametic Phase 
Equilibrium (GPE) expectations. 

6. Under the null hypothesis that an individual 
15 belongs to any one given candidate population, the 

probability of the multi-locus genotype is computed 
for that population. 

7. The posterior probability of the individual 
belonging to each of the candidate populations is 
then calculated utilizing any available prior 
knowledge concerning the population of origin. 

8. The most likely population of origin of the 
tested individual is that population which possesses 
the greatest posterior probability of origin. It is 
recommended that an individual be assigned to that 
population only when the posterior probability of 
origin exceeds a threshold, such as 80%. 



20 



25 



30 



9. The percentage of genotypes more rare than the 
genotype of the individual in the most likely 
population of origin can be calculated or simulated 
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in order to ascertain whether the individual may 
actually belong to a novel population not included in 
the set of candidate populations. 

This model has application for example in the 
livestock industry for assigning an individual animal 
to a breed or to a population based on a desirable 
trait such as animal growth, quality grade, yield 
grade, marbling, rib-eye muscle area, dressing 
percentage, or meat tenderness. 
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Summary Of The Invention 

The present invention provides a method of assigning 
an individual to a population of origin, which 

5 comprises : 

(a) identifying a set of candidate populations of 
origin, wherein each candidate population is 
characterized by genotype frequencies and allele 
frequencies at one or more marker loci; 

10 (b) determining a population prior genotype 

probability for each individual and candidate 
population of origin using knowledge concerning 
the individual which is available prior to 
genotyping the individual ; 

15 ( C ) genotyping the individual to identify the 

alleles at one or more of the marker loci 
identified in step (a) to thereby identify the 
individual' s genotype; 

(d) based on - the identified genotype of the 
20 individual, sequentially determining a 

population - genotype probability for each 
candidate population of origin under a null 
hypothesis that the individual arose from the 
population; 

25 (e) combining the population prior genotype 

probability from step (b) and the population 
genotype probability from step (d) to obtain a 
population posterior genotype probability for 
each candidate population of origin; 

3 0 (f ) identifying a most likely population of origin 

wherein the population has the largest posterior 
genotype probability among the set of candidate 
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populations ; and 

(g) assigning the individual to the population 
identified in step (f) . 
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Description Of The Invention 



The following definitions are presented as an aid in 
understanding this invention . 

5 

As used herein a marker locus is defined as a unique 
location on a chromosome (locus) within the nuclear 
genome of an individual, at which variation among 
chromosomes and individuals may be detected . 
10 Examples include but are not limited to 

microsatellite, Restriction Fragment Length 

Polymorphism (RFLP) , Random Amplified Polymorphic DNA 
(RAPD) , Variable Number of Tandem Repeat (VNTR) , and 
Single Nucleotide Polymorphism (SNP) loci . Marker 
15 loci are usually named, and the name expressed in 

italics . For example, AGLA1 7 is a microsatellite 
locus located at the centromeric end of chromosome 1 
in cattle . 

2 0 An allele is a genetic variant at a marker locus 

detected on a single chromosome. For example, for the 
A locus there may be n possible alleles and each 
allele is individually designated as A lf A 2 ,-..,A n . 

25 The allele frequency is the frequency of an allele Ai 

at the A locus within a specific population and is 

/i 

defined as F(A ± ) = p A± such that ^ p Ai = 1 * 



Diploid means the nuclear genome of the individual 
3 0 possesses pairs of chromosomes, in which one 

chromosome of each pair is transmitted by each 
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parent. Without loss of generality, the methodology 
described here will be for diploid species. 

Genotype is defined as the combination of alleles at 
5 a single locus that is found within an individual. 

Genotypes at the A locus are of the form A±Aj for i 
and j between 1 and n. Individuals possessing two 
identical alleles A±A± are called homozygotes and 
individuals possessing two different alleles (i ^ j) 
10 heterozygotes . Similarly a multi-locus genotype is 

represented as the genotypes at each locus, e.g., 
A ± A 2 B 3 B 3 C A Cs . 



Genotyping an individual means to analyze a sample of 

15 deoxyribonucleic acid ( DNA) from the individual to 
identify the alleles present at one or more marker 
loci . 

A haplotype is defined to be the set of alleles at 

20 multiple loci that are present in a gamete (sperm or 

ova) . If there are n a , n b and n c alleles present at 

the A, B and C loci, haplotypes are represented as 
A±BjC k for i = l,...,n a ; j = l,...,n b and k = l,...,n c . 

25 Hardy-Weinberg Equilibrium (HWE) means that in a 

random mating population in which there is no 
selection, migration, mutation or drift, population 
genotype frequencies occur as a simple function of 
allele frequencies. Among homozygotes F (AiAi) = p Aj 

30 and among heterozygotes F ( AtAj ) = 2p Ai p Aj m 



Gametic Phase Equilibrium (GPE) : Two (or more) loci 
are defined as being in GPE if all population 
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haplotype frequencies occur as the product of 
individual allele frequencies, viz. F (AiBjCjJ = p Ai p Bj 

p r . for i = l,...,n a ; j = l,.~,n b and k = l,...,n c . For loci 

that are in GPE , individual loci are in HWE, and 

5 multi-locus genotype frequencies are obtained as the 

product of individual locus genotype frequencies. For 
example, F ( A 1 A 2 B 3 B 3 C 4 C 5 ) = F ( A X A 2 ) F (B 3 B 3 ) F ( C 4 C 5 ) = 

< 2 Pai Pa2> < P«V (2 Pc4 Pc5> = 4 Pai Pa2 Pr-,P C 4 Pes- 



10 a candidate population is a population from which a 

sample of individuals has been genotyped for multiple 
marker loci and sample allele frequencies have been 
determined for each locus . 



15 Having due regard to the preceding definitions, the 

present invention concerns a method of assigning an 
individual to a population of origin, which 
comprises : 

(a) identifying a set of candidate populations of 
20 origin, wherein each candidate population is 

characterized by genotype frequencies and allele 
frequencies at one or more marker loci; 

(b) determining a population - prior genotype 
probability for each individual and candidate 

25 population of origin using knowledge concerning 

the individual which is available prior to 
genotyping the individual; 

(c) genotyping the individual to identify the 
alleles at one or more of the marker loci 

30 identified in step (a) to thereby identify the 

individual' s genotype; 

(d) based on the identified genotype of the 
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individual, sequentially determining a 

population genotype probability for each 
candidate population of origin under a null 
hypothesis that the individual arose from the 
5 population; 

(e) combining the population prior genotype 
probability from step (b) and the population 
genotype probability from step (d) to obtain a 
population posterior genotype probability for 
10 each candidate population of origin; 

'(f) identifying a most likely population of origin 
wherein the population has the largest posterior 
genotype probability among the set of candidate 
populations; and 

15 (g) assigning the individual to the population 

identified in step (f ) . 



In one embodiment of the method, the individual is 
only assigned to the most likely population of origin 

20 if the posterior genotype probability for the most 

likely population of origin exceeds a threshold 
value. In one embodiment, the threshold value is 
determined empirically. In one embodiment, the 

threshold value is determined using a sample of 

25 individuals from each candidate population who are 

independent of individuals used to characterize each 
candidate population. In one embodiment, the 

threshold value is varied to determine the percentage 
of individuals who a) cannot be classified to a 

30 population of origin, b) are correctly classified, 

and c) are incorrectly classified. 
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In one embodiment, the method further comprises: 

(a) computing a probability with which genotypes 
rarer than the individual's genotype occur in 
the most likely population of origin; and 

(b) if the probability in step (a) is above a 
threshold value, assigning the individual to the 
population of origin previously identified as 
the most likely population of origin, or if the 
probability in step (a) is not above a threshold 
value, assigning the individual to a novel 
population that is not represented among the set 
of candidate populations of origin. 

In one embodiment, the threshold value is determined 
empirically. In one embodiment, the threshold value 
is determined using a sample of individuals from each 
candidate population who are independent of 
individuals used to characterize each candidate 
population. In one embodiment, the threshold value 
is varied to reduce the percentage of individuals who 
are incorrectly classified to a population. 

In one embodiment of the method, the population prior 
genotype probability is based on one or more 
morphological features of the individual. In a 

further embodiment, one or more morphological 
features allow the exclusion of one or more candidate 
populations of origin. In different embodiments, one 
or more morphological features are selected from the 
group consisting of coat color, presence or absence 
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of horns, and presence or absence of Bos indicus 
(humped or Zebu cattle) features such as a shoulder 
hump or a long, downs wept ear . In a - further 

embodiment, the coat color is black or nonblack. 

In one embodiment, the population prior genotype 
probability is set to equal a proportion of total 
population size that comprises each candidate 
population of origin. In another embodiment, the 
population prior genotype probability is assumed to 
be uniform for each candidate population of origin. 

In one embodiment of the method, the marker locus 
genotypes for each candidate population of origin are 
in Hardy-Weinberg Equilibrium and Gametic Phase 
Equilibrium. In other embodiments, the marker locus 
genotypes for each candidate population of origin are 
not in Hardy-Weinberg Equilibrium or Gametic Phase 
Equilibrium. 

In one embodiment of the method, the individual is an 
animal . In further embodiments, the animal is a cow, 
a heifer, a steer, a bull, a bullock, a pig, a horse, 
a fish, a chicken, a duck, a lamb, a shrimp, an 
oyster, a mussel, or a shellfish. 

In one embodiment of the method, the candidate 
population of origin is selected based on a desirable 
trait. In further embodiments, the desirable trait 
is selected from the group consisting of one or more 
of animal growth, quality grade, yield grade, 
marbling, rib-eye muscle area, dressing percentage, 
meat tenderness , meat flavor, meat palat ability , 
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fatness, fat color, unsaturated fatty acid content of 
fat, reproductive efficiency, prolificacy, disease 
resistance, feed conversion efficiency, drought 
tolerance, and heat tolerance. Marbling score in 
beef cattle is a subjective score assigned by a 
United States Department of Agriculture (USDA) grader 
to a carcass based upon the amount of intramuscular 
fat visualized in the longissimus dorsi muscle at the 
12th to 13th rib juncture in properly chilled 

0 

carcasses {United States Standards for Grades of 
Carcass Beef, 1997) . Ribeye muscle area is the 
cross-sectional area of the longissimus dorsi muscle 
at the 11th to 12th rib juncture and is measured 
subjectively or by means of a grid calibrated in 
tenths of an inch at the same time as the marbling 
score is obtained. Quality grade is assigned by the 
USDA grader and is a combination of the marbling 
score and maturity (age) of the animal estimated from 
the size, shape and ossification of the bones and 
cartilages (especially the split chine bones) and the 
color and texture of the flesh. Younger animals (A 
maturity) are not penalized, but older animals (B 
maturity) have their marbling scores down rated into 
the quality grade. Yield grade is assigned by the 
USDA grader and is an estimate of the yield of 
closely trimmed (1/2 inch fat or less), boneless 
retail cuts expected to be derived from the major 
wholesale cuts (round, sirloin, short loin, rib, and 
square-cut chuck) of a carcass. The yield grade of a 
beef carcass is determined by considering four 
characteristics: the amount of external fat; the 
amount of kidney, pelvic and heart fat; the area of 
ribeye muscle; and the carcass weight. Carcasses 
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possessing large amounts of exterior and interior fat 
receive larger yield grade scores indicating lower 
yields of lean meat. Dressing percentage is the 
ratio of hot carcass weight (the eviscerated carcass) 
to live animal weight immediately preslaughter and 
expressed as a percentage. 

In one embodiment of the method, the candidate 
population of origin is selected based on an 
undesirable trait. In a further embodiment, the 
undesirable trait is toughness of meat. 

This invention will be better understood from the 
methodology and examples which follow. However, one 
skilled in the art will readily appreciate that the 
specific methods and examples discussed are merely 
illustrative of the invention as described more fully 
in the claims which follow thereafter. 



Methodology 



Candidate Population Data 

Baseline data are gathered for each of the 
5 populations that are going to be candidates for the 

assignment of individuals. This should represent all 
of the known extant populations . The process involves 
sampling individuals from each population and 
genotyping them for the marker loci that are to be 
10 used in the classification process. The larger the 

number of individuals in each sample the better; 50 
individuals is a "reasonable" target. Fewer 
individuals will sometimes be necessary for small 
populations. 

15 

The data that are collected on the individuals to 
characterize the candidate populations are used to: 
a) estimate allele frequencies in the candidate 
population, b) estimate genotype frequencies in the 

20 candidate population, and c) test to determine if the 

marker loci are in Hardy-Weiriberg Equilibrium and 
Gametic Phase Equilibrium in each of the candidate 
populations. The allele frequencies are estimated by 
counting the number of alleles of each type that are 

25 present in the sample and expressing the totals as a 

proportion of the total number of alleles in the 
sample. Similarly, the genotype frequencies for each 
marker locus are estimated by counting the number of 
genotypes of each type that are present in the sample 

30 and expressing the totals as a proportion of the 

total number of genotypes in the sample. 
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The numbers of alleles present in the sample from 
each population are tabulated. Let be the number 
of alleles present in the sample of individuals which 
are known with certainty as having originated from 
the i th candidate population i = l,...,p. In a diploid 

i 

species, N is twice the number of individuals in the 
sample and the sample size may vary among the p 
populations. Suppose that a series of m marker loci 
are genotyped in all individuals and populations. 
Within, each population, the resulting genotype counts 
may be tested for Hardy-Weinberg Equilibrium (HWE) 
and Gametic Phase Equilibrium (GPE) using the well 
known likelihood ratio or x 2 "goodness of fit" tests, 
as described for example in Weir (1996). Without loss 
of generality, we assume that each of the candidate 
'populations is found to be in HWE and GPE. Further, 
the number of alleles present in the sample for each 
marker locus and each population is tabulated as 

follows. Let n \ be the number of Ai alleles detected 

in the sample from the i th population. The data for 
locus A (and similarly for the remaining marker loci) 
may be represented as: 
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Note that certain of the n' = 0 if the j th allele at 

Aj 

* til 

the A locus is not detected in the sample from the 1 
population . 

5 As an example, suppose that at the A locus we 

observes three alleles A lr A 2 and A 3 . The individuals 
are diploid, so genotype is defined by the 
combination of two alleles that are present in any 
one individual. Assume that when we genotype 120 
10 individuals from a given candidate population, we 

observe the following: 

Genotype Number of individuals 



AiAi 2 2 

15 A X A 2 12 

AxA 3 8 

A 2 A 2 4 0 

A 2 A 3 16 

A3A3 22 

20 Total 120 . 



The genotype frequencies are obtained from the sample 
as the relative frequencies of the genotypes, so for 
the A1A1 genotype we have a genotype frequency of 

25 22/120 = 0.1833. To obtain the allele frequencies, 

we count the number of alleles present in the sample 
using the genotype counts above. So there are 22 A3A1 
individuals with 2 Ai alleles, 12 AiA 2 individuals 
with 1 Ai allele and 8 A1A3 individuals with 1 A x 

30 allele. This gives us a total of 64 Ai alleles. 



Therefore 
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Allele Number of alleles 
Ai 64 
A 2 108 

A3 68 

Total 240 . 



The total number of alleles is of course twice the 
total number of individuals. 

Candidate Population Prior Genotype Probabilities 
For each individual to be tested, prior probabilities 
are assigned for the probability of belonging to each 
of the candidate populations. Prior probabilities 
are assigned based on knowledge that is available 
before the DNA sample from an individual was analyzed 
for marker genotype information. If there is no 
prior information then each population is assigned an 
equal prior probability of having given rise to the 
individual. Alternatively, certain morphological 

data may be available on an individual which allow 
the exclusion of certain of the candidate 
populations, in which case the prior probabilities of 
these populations for this individual are set to 
zero. If, for example, the individual is a horned 
animal and only three out of ten candidate 
populations contain horned animals, the prior 
probabilities for the seven non-horned populations 
would be set to zero and the prior probabilities for 
the three horned populations would each be set to 1/3 
in the absence of further information . 

Let Pij represent the a priori or prior probability 



that the j th individual originated from the i 



th 
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population. If individuals are sampled at random 
with respect to population of origin, we should elect 
to set the population prior probabilities equal to 
the proportion of the total population size that 
comprises each candidate population. If no pre- 
existing information was available, we assume a non- 
informative or uniform prior of P ± j = 1/p for i - 
l,...,p. Hence the individual had an equal chance of 
originating from any of the p candidate populations 
when a uniform prior is used. 

Prior probabilities may differ for each individual 

that is to be tested, but in every case must sum to 

p 

unity as ^ P ±j - 1. 

/=i 

Candidate Population Genotype Probabilities 
Each individual is genotyped for the marker loci for 
which baseline information was gathered for each 
candidate population. The individual's genotype 

probability is then estimated (using a maximum 
likelihood approach) for each of the candidate 
populations . 



Suppose that an individual that is to be assigned to 
a population is genotyped for the m marker loci and, 
arbitrarily, the multi-locus genotype is determined 
to be A1A2B2B3...M4M5 . The probability of this genotype 
occurring in the i th population is determined as 
follows : 

1. Under the null hypothesis that the individual 
originated from the i th population, the individual may 



be incorporated into the sample data for this 
population and the allele counts at each locus 
updated. For example, at the A locus, the sample 
counts for the i th population become: 
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Ay 




Ana 


1 I J 


1 Allele Count 


n Al + l 


n A2 + l 


n A3 




n Ana 


N+2 



Similarly, at the B locus, the sample counts for the 
i population become: 
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2. Under the assumption of Hardy-Weinberg 
Equilibrium in the i th population, the maximum 
likelihood estimate (MLE) of genotype frequency at 
each of the marker loci is obtained. For example, at 
the A locus, the MLE of the probability of the A ± A 2 
genotype Fi(^ 2 ) is 2 (n^, +1 ) (n A * 2 +1 ) / (n'+2 ) 2 . At the B 
locus, the MLE of the probability of the B 3 B 3 genotype 
F i (B 3 B 3 ) is (n B i 3 +2) 2 /(N i + 2) 2 . 



3 . Under the assumption of Gametic Phase 
Equilibrium in the i th population, the MLE of the 
multi-locus genotype frequency is obtained as the 
product of the genotype frequencies at each of the 
marker loci. Thus, the MLE of the probability of the 
A x A2B 3 B 3 ..M A M b genotype in the i th population 
Fi(AiA 2 B 3 B 3 „.M 4 M 5 ) is: {2(n A i ,+l) (n A ' 2 +l) / (^ + 2) '} { (n B i 3 +2) 2 
/(N+2) }...{2(n M ' 4 +l) (n N j 5 +l)/(N i +2) 2 } . 



In general, let Gj represent the multi -locus genotype 
of the j th individual . Then Fi-(Gj) represents the 
probability of the genotype of the j th individual in 
the i population . 

If the candidate population is not in Hardy-Weinberg 
and Gametic Phase Equilibrium, the population 
genotype frequency is estimated by the frequency of 
the individual's genotype in the sample from each 
candidate population . This process involves 

sequentially adding the individual to the sample so 
that the frequency of any genotype that was not 
present in the original sample is 1/(N+1) where N is 
the number of individuals sampled for the population. 

Candidate Population Posterior Genotype Probabilities 
The posterior probabilities that the individual 
belongs to each of the candidate populations are 
determined, and the population with the largest 
probability is selected as the "most likely" 
population of origin . 

The posterior probability of the j th individual' s 
genotype originating from the i th candidate population 
is obtained by combining both the population prior 
genotype probabilities and the candidate population 
genotype probabilities as follows : 



PuFjGj) 

p 
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For the j individual, O ij i s computed for each 
population i = l,...,p and the population with the 
greatest Oij value is the most likely population of 
origin among the set of candidate populations. 

5 

Simply taking the population with the largest 
posterior probability as being the "most-likely 11 
population of origin may not be a very good decision 
rule. The additional steps to the procedure 

10 described below are designed to help the user arrive 

at a decision rule that has quantified success and 
error rates . 

A threshold value must be determined for the 
15 posterior probability in order to define a decision 

rule for accepting an individual as originating in 
one of the candidate populations. For example, one 
may choose to accept an individual as originating in 
a population if 0^ exceeds 0.90 for one population. 
20 This is interpreted to mean that among the available 

candidate populations, there is a 90% chance the 
individual originated from the most likely population 
and only a 10% chance of originating in any of the 
other populations. Individuals that are hybrids 
25 typically produce approximately equal posterior 

probabilities of belonging to the two populations 
that contributed the parents of the hybrid and thus 
these hybrid animals are generally not assigned to 
any one population. 

30 

If a second set of samples of individuals from each 
of the populations is available or can be obtained 
that are independent of the samples used to produce 
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the baseline candidate population data, these 
individuals can be genotyped and posterior 
probabilities can be calculated for each individual 
and each of the candidate populations. A decision 
rule can then be empirically determined that meets 
the requirements of the user. For example, the user 
may wish to ensure that 95% of the individuals that 
are assigned to a population are correctly assigned . 
Thus, one may find that assigning an individual to a 
population only when the posterior probability of 
belonging to that population is greater than 90% 
results in 95% of individuals being correctly 
assigned to their population of origin. By altering 
the threshold for the posterior probability decision 
15 rule, one can determine the . proportion of individuals 

that are correctly classified, incorrectly classified 
and not classified respectively . Individuals for 
which the largest posterior probability falls below 
the threshold are not assigned to a population. 



10 



20 



Rarity of Genotype in Candidate Population 
Occasionally an individual may be incorrectly 
assigned to a population because the individual arose 
from a population that was not represented in the 
25 group of candidate populations . In this case, the 

procedure described above will identify the 
population that is most similar to the population 
from which the individual actually arose. If the 
posterior probability is greater than the threshold 
(i.e., all of the remaining populations are quite 
different to the population from which the individual 
actually arose ) , the individual will be incorrectly 
assigned . These cases of incorrect assignment can be 



30 
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identified by calculating the probability of a rarer 
genotype in the assigned population. If this 
probability is low, say 5% (this threshold is also 
user defined and can also be determined empirically) , 
then one might reject the individual from the 
population and change the classification of the 
individual to "unassigned . " The underlying logic here 
is that even though the individual shows strong 
evidence for belonging to only one of the 
populations, it is actually a rare genotype in that 
population. From a statistical perspective, it is 
more likely that the individual actually has a fairly 
common genotype in a population that is not 
represented among the candidate populations. 

If there are m marker loci and there are n k alleles at 
the k th marker, there will be T =|^ "*("' + 1 > possible 



multi-locus genotypes in any population. It does not 
require many loci or many alleles at the' individual 
marker loci for the total number of genotypes, T, to 
become very large. For example, with m = 10 marker 
loci and n k = 6 alleles at each locus (which is 
characteristic of microsatellite loci ) , there are 
more than a trillion possible genotypes. In this 
25 case, we estimate the frequency of a genotype that is 

rarer than the genotype present in the tested 
individual by simulation. A large number of multi- 
locus genotypes (such as 100,000) is simulated by 
drawing alleles at random from the most likely 
population using the relative allele frequencies for 
the population after adding the alleles of the 
individual to the sample. The probability of each of 
the simulated genotypes is then computed as described 



above. Finally, the percentage of simulated genotypes 
for which the multi-locus genotype frequency is lower 
than that of the tested individual is calculated. If 
the percentage of rarer genotypes is low, say 5% or 
less, we might conclude that the tested individual 
has a genotype that is too rare for it to truly have 
originated in the most likely population and that the 
individual actually belongs to a novel population. 

Advantageous Features of the Approach 

The approach described herein provides certain 
advantages including, but not limited to, the 
following : 

1. The approach assumes that any allele that is 
present in an individual to be tested but that is 
absent from the sample for any one candidate 
population, is absent because it is a rare allele 
that was not captured in the sample rather than the 
allele being population specific. This approach 
loses statistical power in the sense that population 
specific alleles unequivocally eliminate from 
consideration any candidate population that does not 
possess the alleles . However, central to this 
argument is the fact that without very large samples 
of individuals from each of the candidate 
populations, it is impossible to discriminate between 
alleles that are rare and alleles that are absent 
from any population. Thus, the approach presented 
herein is conservative in that it will underestimate 
the posterior probability of population of origin 
when there are population specific alleles. On the 
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other hand, our approach gains specificity in that 
populations are not rejected from consideration 
simply because an allele was not present in the 
sample of individuals drawn from the population. 

2. The approach recognizes that there may well be a 
number of potential populations that were not 
selected as candidates because they were not sampled 
in order to define them as a candidate population. 
There may well be individuals that are submitted for 
testing that originated in some novel and unsampled 
population. These individuals will have posterior 
probabilities of population of origin estimated by 
the procedure and will be assigned to the candidate 
population that is genetically most similar to the 
true population of origin. In some cases, the 
posterior probability for one candidate population 
may be very high, even though the individual did not 
originate from this population. In order to identify 
misclassifications, we estimate the cumulative 
probability distribution function for genotypes that 
are rarer than that of the individual to be 
classified. This allows estimation of the 

probability of a rarer genotype in the most likely 
population of origin of the individual. If this 
probability is low, perhaps 5% or less, one should 
conclude that the individual actually originated in a 
population not included in the candidate set, since 
only 5% of genotypes are rarer in the most likely of 
the candidate populations. 

3. The power of the approach depends on the number 
of marker loci that are typed in the individuals to 
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be classified and the candidate populations and the 
degree to which marker allele frequencies are skewed 
among the candidate populations. However, the 
approach is able to utilize all available 
information. If certain individuals have been 
genotyped for only a subset of the available markers, 
the candidate population genotype probabilities and 
therefore the posterior probabilities are computed 
only for the available multi-locus marker genotype. 

4. The probability thresholds for assigning an 
individual to a population of origin based upon the 
posterior probability and for accepting the 
individual as truly belonging to the most likely 
population based upon the probability of a rarer 
genotype must be determined empirically. Preferably 
a second independent sample from each of the 
candidate populations should be genotyped and 
posterior probabilities computed for each candidate 
population. By varying an artificial posterior 
probability threshold for accepting an individual as 
belonging to the most likely population of origin, we 
can empirically determine the percentage of 
individuals that a) cannot be classified to a 
population of origin, b) are correctly classified, 
and c) are incorrectly classified. Among those 
individuals that are incorrectly classified to a 
population based upon the posterior probability, 
altering the acceptance threshold for the percentage 
of rarer genotypes further allows the reduction in 
the overall percentage of misclassif ied individuals. 
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Ex ample 



Consider the following three candidate populations, 
which have been genotyped for two marker loci. The A 
locus has 2 alleles and the B locus 3 alleles if we 
ignore the subdivision into candidate populations. 
The individuals that were genotyped for the two loci 
from each of the candidate populations gave the 
following allele counts: 
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Suppose that the first individual presented for 
classification to a population of origin has genotype 
Gi = AlA^ B 3 E 3 . The candidate population genotype 
probabilities are : 

Fi(Gi) = {22 2 /42 2 } {12 2 /42 2 } = .0224, 
F 2 (Gi) = {22 2 /52 2 } {2 2 /52 2 } - .0003, 
F 3 (Gi) = {52 2 /62 2 } {57 2 /62 2 } = .5946. 

We shall assume that this individual has an equal a 
priori chance of originating from any of the three 
populations and hence P n = P 21 = P 31 = 1/3. 



The posterior probabilities of belonging to each of 
the candidate populations are: 
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<D 21 = 



$31 = 



0.33*0.0224 



0.33x0.0224+0.33x0.0003+0.33x0.5946 

0.33x0.0003 

0.33.Y0.0224 + 0.33x0.0003 +0.33x0.5946 

0.33x0.5946 

0.33x0.0224+0.33x0.0003+0.33x0.5946 



= .0363 
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Since the magnitude of the posterior probability for 
the third population exceeds a threshold (which we 
shall arbitrarily set at .90 for the purposes of this 
example) , we can conclude at this stage that the 
individual originated either from the third candidate 
population or from a novel population genetically 
similar to the third candidate population. In order 
to discriminate between these two situations, we must 
compute the probability of a rarer genotype than 
A 1 A 1 B 3 B 3 in the third candidate population. Since 
there are only 18 possible genotypes for this 
example, we do not need to simulate the distribution 
of genotypes and provide the distribution: 
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The genotype distribution reveals that the A1A1B3B3 
genotype is the most common genotype in the third 
candidate population and that fully 40.54% of 
individuals within this population have genotypes 
that are more rare than the genotype of the tested 
individual. Therefore we conclude that the tested 
individual originated from the third candidate 
population and not from a novel population that was 
similar in genetic structure. 

Applications 

The approach disclosed herein has application for 
example in the livestock industry where there is a 
need to be able to determine value differences in 
live animals due to the inherent genetic variation in 
the yield of tender and marbled beef from their 
carcasses. Packers are forced to sort through 
thousands of carcasses from animals slaughtered each 
day in order to identify those that meet the 
specifications of their customers. Due to the very 
high daily volume of slaughter animals and limited 
cooler space (which reduces ability to sort) , packers 
are unable to efficiently market their inventory 
based upon quality specifications. Further, packers 
have no ability to discriminate among the carcasses 
that do not grade choice that could be marketed as a 
tender product. By and large, the variation in 
product specifications that the packers must manage 
each day correlates directly to the variation in the 
cattle received. 



Knowledge of an animal's underlying genetic 
predisposition to yield marbled and tender beef would 
allow the stratification of the existing commodity 
market to facilitate the management and marketing of 
animals based upon product specifications. As much 
as 50 percent of the variation in growth and carcass 
yield and quality attributes in cattle is determined 
by the additive effects of genes. The remaining 
variation is due to the environment that an animal is 
exposed to prior to entry to the feedlot and due to 
the management the animal receives during the feedlot 
and slaughter phases of production. Thus, at least 50 
percent of the variation that currently exists within 
the commodity cattle market could be eliminated by 
grouping cattle according to their individual 
genotypes at entry into the feedlot. These animals 
could then be managed, fed and slaughtered as a 
uniform group and could then be marketed according to 
their quality attributes. This model would allow the 
creation of new "branded" products for the marketing 
of products such as lean and tender beef. 

The approach described in the present application can 
be applied not only to beef cattle but also to other 
livestock such as fish, pigs, chickens, lambs, 
shrimp, mussels, oysters, and shellfish. 
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